Site Reliability Engineering: How Google Runs Production Systems

Site Reliability Engineering: How Google Runs Production Systems

  • Downloads:3983
  • Type:Epub+TxT+PDF+Mobi
  • Create Date:2021-03-09 03:17:20
  • Update Date:2025-09-06
  • Status:finish
  • Author:Betsy Beyer
  • ISBN:B01DCPXKZ6
  • Environment:PC/Android/iPhone/iPad/Kindle

Summary

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation。 So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world。 You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization。

This book is divided into four sections:

Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use

Download

Reviews

Benjamin Romano

This book provides a good overview of how to maintain large scale production systems for an engineer new to this area。 For engineers already familiar with these concepts, there are great insights in scaling operational excellence to a broader engineering organization。 However for those in the later camp, it can be difficult to make it through to those interesting sections。 The book is a collection of essays covering a specific topic which unfortunately leads to a large amount of recapping ideas This book provides a good overview of how to maintain large scale production systems for an engineer new to this area。 For engineers already familiar with these concepts, there are great insights in scaling operational excellence to a broader engineering organization。 However for those in the later camp, it can be difficult to make it through to those interesting sections。 The book is a collection of essays covering a specific topic which unfortunately leads to a large amount of recapping ideas discussed in another essay。 Ultimately, I'd recommend this as a reference instead of reading end to end。 。。。more

Alex Bulankou

Back when I was starting to work with software, every project seemed like a dangerous risky adventure with expected casualties in a way of burndown, overtime work, all-nighters, incidents where all data and source code gets destroyed, and death marchers with a huge upside in case it is a success。 No one in their right mind would want to do it past their 20s。 When reading this book, it set in on me how much the field has matured, became sustainably boring and if starts feeling like something an e Back when I was starting to work with software, every project seemed like a dangerous risky adventure with expected casualties in a way of burndown, overtime work, all-nighters, incidents where all data and source code gets destroyed, and death marchers with a huge upside in case it is a success。 No one in their right mind would want to do it past their 20s。 When reading this book, it set in on me how much the field has matured, became sustainably boring and if starts feeling like something an engineer would not mind at all doing all their life, 9 to 5 till retirement。 。。。more

Javed Nissar

A bit dry but useful

Vytas

Good reference manual for SREs。 Picked a few management approaches which I will try at work

Sandhya Chandramohan

One of my favorite quotes is by Hiroshi Mikitani, CEO of Rakuten, who said that 'Everything breaks at multiples of 3 and powers of 10'。 This was said more in relation to fast growing companies but I always think about software when I think of this quote。 Building software that works for 10,000 even 100,000 people is easy, anyone can do it with some crappy code。 But building it for 10 million or billion people is insanely tough。 Building software that scales is the real challenge。 And in todays w One of my favorite quotes is by Hiroshi Mikitani, CEO of Rakuten, who said that 'Everything breaks at multiples of 3 and powers of 10'。 This was said more in relation to fast growing companies but I always think about software when I think of this quote。 Building software that works for 10,000 even 100,000 people is easy, anyone can do it with some crappy code。 But building it for 10 million or billion people is insanely tough。 Building software that scales is the real challenge。 And in todays world there are very few companies where you get to work on that kind of scale outside of the big 5 FAANG。 Google is the industry leader in the space of running production systems at scale and pioneered the role of Site Reliability Engineers (an amalgam of software and systems engineers)。 This book lays out the philosophy behind a SRE, the tools, practices and mindset。 It provides an inside look at how Google runs its production systems, ensures performance and reliability at scale。 An SRE's day-to-day activity is building and operating large distributed computing systems。 Being a Production engineer at Facebook (a Google SRE counterpart) this book, despite being outdated now, was a great introduction to a SRE's life and thought process。 。。。more

Harald Blikø

A treasure throve of information but it’s messy and hard to obtain。 There are many books that better and easier explain the concepts and lessons first described in this book。 If you are new to DevOps and SRE don’t start with the book。 If you are familiar with the topics, somewhat experienced and want to get to the source or want to know it as technical reference this is a book for you。 If not you should probably find a book that explains the concepts more pedagogical。

Chris

A book of insight into Google's SRE approach。 It's written by Googlers and has been designed to be read in one go or as a reference。 I've read most of it in depth。 I suspect I will be referring back to it at some point。 A book of insight into Google's SRE approach。 It's written by Googlers and has been designed to be read in one go or as a reference。 I've read most of it in depth。 I suspect I will be referring back to it at some point。 。。。more

Samiur Khan

Pretty good coverage of practices。 No extra padding。 Just straight to the point on how to run systems well。 Docked a star though because I wish they went over alternatives to their recommendations and presented more data for why they went the way they chose。 Also wish they touched more on how to convince an organization to follow through with SRE recommendations (where the aforementioned data would come in handy)。 Recommendations are pretty clear for how to run an SRE org but the very creation o Pretty good coverage of practices。 No extra padding。 Just straight to the point on how to run systems well。 Docked a star though because I wish they went over alternatives to their recommendations and presented more data for why they went the way they chose。 Also wish they touched more on how to convince an organization to follow through with SRE recommendations (where the aforementioned data would come in handy)。 Recommendations are pretty clear for how to run an SRE org but the very creation of the org might be more difficult。 。。。more

Sambridi

PS: This isn't really a review。 It is a note for myself。I read 5/6 chapters。 But participated in team wide discussions on this book, so have a sense of what more is in store。 I definitely want to go back and finish it at some point。 PS: This isn't really a review。 It is a note for myself。I read 5/6 chapters。 But participated in team wide discussions on this book, so have a sense of what more is in store。 I definitely want to go back and finish it at some point。 。。。more

Rhys Powell

I'm guessing, to get the real as defined by those that created it SRE, you need to read this book。 It does give some good info, process and ideas。 Sad thing is it's very dry, making a reasonable size book feel 10times bigger。My approach would be to just do a chapter, then leave that settle in for a while, as opposed to reading it last thing at night。 Can't complain too much, I often got to sleep far quicker than normal。 I'm guessing, to get the real as defined by those that created it SRE, you need to read this book。 It does give some good info, process and ideas。 Sad thing is it's very dry, making a reasonable size book feel 10times bigger。My approach would be to just do a chapter, then leave that settle in for a while, as opposed to reading it last thing at night。 Can't complain too much, I often got to sleep far quicker than normal。 。。。more

Elie De Brauwer

I enjoyed reading this book, it really helps explaining the 'SRE' roles and responsibilities where in todays 'ops' and 'devops' world you very often see this term abused as something in the middle while it is in fact a completely different domain (which is often completely lacking)。The book consists out of chapters written by different authors focusing on a specific topic, the disadvantage of this is that the book feels as it is a collection of papers (though the editor did some really good work I enjoyed reading this book, it really helps explaining the 'SRE' roles and responsibilities where in todays 'ops' and 'devops' world you very often see this term abused as something in the middle while it is in fact a completely different domain (which is often completely lacking)。The book consists out of chapters written by different authors focusing on a specific topic, the disadvantage of this is that the book feels as it is a collection of papers (though the editor did some really good work attempting to make it structure and homogeneous) but the result is that there are some really good/interesting chapters and some are just a bit less interesting。 。。。more

João Quitério

This is more a collection of related articles than a book。 Some chapters are awesome, others not so much but there were a lot of valuable concepts and insights (SLOs, error-budgets, their approach to incident management or post-mortems) that I've found very valuable when talking to others on these topics。 This is more a collection of related articles than a book。 Some chapters are awesome, others not so much but there were a lot of valuable concepts and insights (SLOs, error-budgets, their approach to incident management or post-mortems) that I've found very valuable when talking to others on these topics。 。。。more

Rafael Remondes

This books contains some very good advice and interesting lessons that we can learn when reading about how Google runs its own production systems。I think some examples and recommendations are really useful, however, they only apply to big teams or companies with the resources and manpower to apply them。 A small part of them, apply only to Google even。

Aleksandar

I learned a lot from this book。 However the style in which it is written makes it really hard to read the book and make progress。

Karl Fischer

A must read for SREs。 Very insightful learning and approaches from probably the biggest and most complex distributed system in the world。 The book is based on experiences from different people at Google, which is why some chapters have overlapping content。 Nevertheless, the content is great and it doesn't harm to read some parts twice :) A must read for SREs。 Very insightful learning and approaches from probably the biggest and most complex distributed system in the world。 The book is based on experiences from different people at Google, which is why some chapters have overlapping content。 Nevertheless, the content is great and it doesn't harm to read some parts twice :) 。。。more

Mohammed Sibghatullah Ayubi

You will come back to it more than once in your career。 Keep it handy。

Niharika

It's hard to give this book a star rating - there were some essays that I really enjoyed and learned from, while others I found dry。 Perhaps my only recommendation would regarding this book would be to not read it cover to cover, and instead, hone in on the parts of Site Reliability Engineering, and read just those essays。 Somewhat tangential, and likely due to recency bias, but one of the essays I liked the most was one of the last ones, in which the authors interviewed a host of Google softwar It's hard to give this book a star rating - there were some essays that I really enjoyed and learned from, while others I found dry。 Perhaps my only recommendation would regarding this book would be to not read it cover to cover, and instead, hone in on the parts of Site Reliability Engineering, and read just those essays。 Somewhat tangential, and likely due to recency bias, but one of the essays I liked the most was one of the last ones, in which the authors interviewed a host of Google software engineers, who had backgrounds in industries that also cared heavily about reliability - think air traffic control, working on the 911 system, nuclear power plant engineers, and lifeguards。 The essay compared and contrasted those industries' definitions and standards for reliability with the Google SRE teams, and while there were quite a lot of similarities, there were also notable differences which I found interesting。 。。。more

Mengyi

This is a complete collection of everything about building the SRE team, from their practices to how to onboard a new SRE to the team。I am personally really inspired by the concept of error Budget and the share by default culture folders by practices such as blameless postmortem。

Yixing J

pretty dense and informative book about how SRE is run at Google, much more related to Internet company than my current industry (AI robotics), is very centered around building SRE team, team's culture and processes as well, could be helpful for bigger companies that are in need of this pretty dense and informative book about how SRE is run at Google, much more related to Internet company than my current industry (AI robotics), is very centered around building SRE team, team's culture and processes as well, could be helpful for bigger companies that are in need of this 。。。more

Danny Gibson

Finally finished after abandoning in the midst of some Qualitrics eng reading group。 This book can be a slog at times, but there are gems that make it a solid resource to a group developing any operations team。 The Google case studies were pretty exciting to read, as well。

Ilya

Too boring to finish

Bartosz Pranczke

This is a good book, especially if you want to broaden your tool belt for reliability issues。 Most topics are not covered enough to be much of practical value but they give the starting point and the "Google" solution。 You will know how to find some kind of a middle ground, given that you are not Google so their solution probably is overblown for most。 For me, knowing what's possible is helpful when learning about a new domain of knowledge。 The book also shows how broad is the subject of "the ap This is a good book, especially if you want to broaden your tool belt for reliability issues。 Most topics are not covered enough to be much of practical value but they give the starting point and the "Google" solution。 You will know how to find some kind of a middle ground, given that you are not Google so their solution probably is overblown for most。 For me, knowing what's possible is helpful when learning about a new domain of knowledge。 The book also shows how broad is the subject of "the application works on production" which may be an excellent learning material for any developer not only to learn about that but also to have an appreciation for people who take care of it。 A lot of this book is also just fun to read because there are plenty of details about how Google works。 This is inspiring stuff :) 。。。more

Oleksandr Bilyk

Good book。 Very recommended for everyone who wants to make their services reliable。 It was great honour for me to sit next to authors of this book。 SRE conference was great。

Wojtek Pietrucha

Very interesting book。 It shows many aspects of software development。 Can be used also as handbook, how to handle specific topics。

Amr

The book is great in terms of getting more understanding of google’s SRE culture。 But I got to a place where it became irrelevant to me to continue the book so I decided to drop it。

Karol

Interesting tour on Google's ops landscape。 The book is a patchwork written by different people, with a bit different goals in mind。 However, I think one can learn a lot about common challenges fo ops: which way to evolve, how to retain people, what should scaling mean, what to do with interruptions, where and how ops should fit into delivery, 。。。 It is, of course, Google's view on the topic, but it is Google's book after all。Yes, there are some parts strictly related to Google's infrastructure, Interesting tour on Google's ops landscape。 The book is a patchwork written by different people, with a bit different goals in mind。 However, I think one can learn a lot about common challenges fo ops: which way to evolve, how to retain people, what should scaling mean, what to do with interruptions, where and how ops should fit into delivery, 。。。 It is, of course, Google's view on the topic, but it is Google's book after all。Yes, there are some parts strictly related to Google's infrastructure, scale and context, not very useful outside。 This was a reason I put it away for some time before picking it up again。 But we need to remember that employer branding was probably also an important reason to publish it。Anyway, the book is written quite well。 It is not bloated, sentences are well thought and informative, it doesn't make you feel being taught。 It makes a good impression。 Having circumstances in mind, it is not easy to imagine it being written somehow differently。 。。。more

Ray

After almost a year I've finally finished reading this book!Unfortunately, when I started reading it I was using the Apple Books app which isn't good for highlighting PDFs。 Much later I switch to the Documents app by Readdle and it works much better。Overall, this is a great book that I'll be coming back to in the future。 Some of the stuff was a little hard to imagine because they're talking about Google scale but for the most part their ideas were applicable to what I see in SWE。From the time wh After almost a year I've finally finished reading this book!Unfortunately, when I started reading it I was using the Apple Books app which isn't good for highlighting PDFs。 Much later I switch to the Documents app by Readdle and it works much better。Overall, this is a great book that I'll be coming back to in the future。 Some of the stuff was a little hard to imagine because they're talking about Google scale but for the most part their ideas were applicable to what I see in SWE。From the time when I did have the Documents app here are some notes:Always use randomized exponential backoff when scheduling retries。 This helps getting a lot of requests all at the same time when something is failing。Data integrity means that services in the cloud remain accessible to users。The most importance difference between backs and archives is that backups can be loaded back into an application, while archives cannot。 。。。more

Joe Lachoff

Boring technical information, read this book for work, it drained my soul of pleasure and replaced it with stale dirty motor oil reeking of Google's relentless pursuit to squeeze every last penny from the filthy technocracy they have created。 I did learn something, but nothing that a well written wikipedia page couldn't have provided。 Boring technical information, read this book for work, it drained my soul of pleasure and replaced it with stale dirty motor oil reeking of Google's relentless pursuit to squeeze every last penny from the filthy technocracy they have created。 I did learn something, but nothing that a well written wikipedia page couldn't have provided。 。。。more

Michał Niczyporuk

Great primer on SRE and how to scale engineering organizations from operating perspective。 Must read。

Suraj

Great book, one of the best book on the topic。if you are running production systems, you should give this book a read。this book is available online(free): https://landing。google。com/sre/books/ Great book, one of the best book on the topic。if you are running production systems, you should give this book a read。this book is available online(free): https://landing。google。com/sre/books/ 。。。more